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Abstract 

We combine a high order compact finite difference approximation and collocation 
techniques to numerically solve the two dimensional heat equation. The resulting method 
is implicit and can be parallelized with a strategy that allows parallelization across both 
time and space. We compare the parallel implementation of the new method with a 
classical implicit method, namely the Crank-Nicolson method, where the parallelization 
is done across space only. Numerical experiments are carried out on the SGI Origin 2000. 
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1 Introduction 

In [6], Jezequel combined the second-order standard finite difference approximation for the 
spatial derivative and collocation technique for the time component to numerically solve the 
one dimensional heat equation. The method, called implicit collocation method (TCM), is 
unconditionally stable. Its principle is as follows: after discretization in space of the problem, 
the solution is approximated at each spatial grid point by a polynomial depending on time. 
The resulting derivation produces a linear system of equations. The order of the method 
is in space the order of difference approximation and in time the degree of the polynomial 
[2, 4, 5]. I CM when implemented on parallel computers, allows the parallelization across 
both time and space. Numerical experiments carried out by Jezequel on the Cray T3E, 
show that the proposed ICM algorithm can achieve acceptable efficiency and under some 
simple assumptions, performs better (computing time) than standard explicit finite difference 
method [6]. 

Recently, we applied TCM to the two dimensional heat equation. For the spatial dis- 
cretization, we used instead a fourth-order compact scheme that was shown to be more 
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accurate than the; second-order one. Our numerical results, performed on a serial computers, 
showed again the stability of TCM [7]. 

Unfortunately, the previous analyses do not include a comparison between TCM and 
other classical implicit techniques to solve the heat equation. Though in TCM we can compute 
(utilizing a unique system of equation) the approximated solution at any point in a given time 
interval, its use introduces an additional dimension (degree of polynomials) to the problem. 
This increases the size of the system of equations to be solved, creates a lot of memory 
requirement and a limitation on the number of spatial grid point to be employed. Though 
TCM offers the parallelization across both time and space, it is not clear that it is more 
attractive, in terms of accuracy and parallel efficiency, compared to other implicit methods 
where the parallelization is done across space only. 

Tn this paper, we compare TCM and a well known implicit technique, namely Crank- 
Nicolson method, in the numerical solution of the two dimensional heat equation when a 
fourth-order spacial discretization is employed. We use similar algorithms to implement both 
methods on a shared memory parallel computer. We estimate the memory and computational 
requirements of each method and determine the the set of conditions for which one method 
is better than the other one. Tn our numerical experiments, we present the accuracy of the 
approximated solution and the parallel efficiency of each method. 

An outline of the paper is as follows. Section 2 presents the two different types of 
discretization. Section 3 discusses some computational considerations needed to implement 
the two methods. Section 4 describes the algorithms. Numerical experiments appear in 
Section 5. We formulate some conclusion in Section 6 


2 Derivation of the System of Equations 

We consider the two dimensional heat equation: 

dU , . ry / 8 2 U , . 

-^(x,y,t) = a \-^{x,y,t) + -^{x,yJ.)J, {x,y,t) 6 fi x [0, oo) (1) 

where fi = [0, 1] x [0, 1], and with the initial condition 

u(x, y, 0) = y), (x, y) £ ft, 

and the boundary conditions 

u(0,yj) = u{\,y,t) = u(x,CU) = yo(-M), and n(x, 1,0 = yi(x, t) for t> 0. 

We assume that /o, / 1, go and g\ are smooth functions in the variable £, i.e., their first 
derivatives with respect to t exist and are continuous. 

Let h = 1 jn and At be the uniform spatial and time mesh-widths respectively. We can 
subdivide the spatial domain and consider the time step as follows: 

Xi-ih, Vj = jK M = 0, 

tk = k At. k — 0, 1, . . . 
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For simplicity we write the approximated solution ol u and its time derivative at the spatial 
grid points (x it i/ j) as: 

U tJ (/.) = u(x t , yj , t), and U'j (/.) - — (x, , Vj , t) . 

At any given time t , if we use the discretization of the steady state Poisson equation 
with a fourth-order scheme [3], we can approximate the spatial derivatives of (1). We obtain 
for any grid point (x;, jjj) , i, j = 1, . ,n — l: 

\ [o? + ,j(i) + v!j„w + uu s (t) + u'j.m + sc'jit)] 

+Ui+\.j+\{t) + I7j-ij+i(0 + + Ui+\,j-i{t) - 20Vij[t)], 


with the conditions 


Uij{0) = ), 

u 0 ,j{t) = fo(yj,t), 

^t,o(0 = 9o(Xu t), 
Ui,n{t ) = t), 


u nj(Q = ^r(yj’ t )’ 

Ko (t) = ^[x it t). 


Eq. 2 is a system of (n - l) 2 ordinary differential equations and for any value of t, it 
is fourth-order in space. In the next two sections, we introduce two methods for discretizing 
the time derivative. 


2.1 Crank-Nicolson Method 


Let Ufj be the appoximation of u(x,y,t) at the spatial grid point ( X{,yj ) and the time level 
t k = kAt. Using (2), if we apply the Crank-Nicolson derivation [lj, we get 


1 + + Cf-i) 

+ (1 - 4 + UfXj + U?\ + (8 + 20p) U-j 1 

. I. 1. - r U r 1* \ 


Av? + lJ+ 


i + uf-ij+i 


+ U k -ij-i + uf +l j_ t ) 


+(1 + 4 p)(uL,i + u£ i+ , + uU.i + u ti- 1 ) + ( 8 - 2 °p) u tj 


(3) 


where /> = ^At. After some simple manipulations, we obtain the linear system of equations 

BU k+l = FU k + R (4) 

where U k+l and U k are vectors with components U k j 1 and U/j, i, j = 1, ■■■,n - 1, re- 
spectively, R is a (u — l) 2 dimensional vector containing the values of \J and b ' at the 
boundary of the domain, and B and F are matrices of order (n - l) 2 given by: 

B = tri[B- 1 , Bq, Bi] ri _i, F = t.ri[F-i, Fq. F[] n _i 
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where 

/?_! = tri[-(>, 1 - \p, F-\ ~ tri\p, 1 + \p, p] n -i, 

B\) — 4ri[l — 4/>, 8 + 20/>, l — 4p] n _i, E; = 4r*[l + 4^, S — 20p, l 4- 4 p] n _j, 

B x = tri[—p, 1 — 4p, — p] n -i, E t = 2ri[/>, 1 + 4p ; p] n -i- 

The subscript n — 1 is the number of rows. 

The Crank-Nicolson method (CNM) is unconditionally stable and is of fourth order in 
space and second order in time. 

2.2 Implicit Collocation Method 

Let Pi,j{t) be the polynomial of degree r satisfying the system (2) at the spatial grid point 
{xi,yj) and at times tk = kAt (k = 0, ...,r — 1). Then for any i,j = - 1 and 

fc = 0, . . . , r — 1, we have 

Pi,j{tk) — Qi,j,rtk "h ®ij,r— d" * * * + T* 

The coefficients dujfl are determined from the initial condition: 

o*,i,0 = Pij{ 0) = = xp{xi,yj). 

To solve the system (2) by the collocation method is to determine the coefficients a,*^ i, . . . , a Z j ?r , 
for i,j — - 1. After some algebraic manipulations (see [6, 7] for details) we obtain 

the linear system of r(n — l) 2 equations 


ax = s ; 


( 5 ) 


where A is a block-tridiagonal matrix given by 


A — tri [Aj_i, A/, A/ + i] h _ t . 


A/_x, A/ and A /+ 1 are square matrices (of order r(n — 1)) defined as 


A/_! 

Ai.- 
A /+ 1 


= tri 


= tri 


= tri 


-E l^E'-4E,-E 

2 or 


n— 1 


i^£? , -4E,4^£J , + 20E ) ~S , -4E 

2 or or 2 or 


41-1 


-E t^E'-4E,-E 

2 a- 


41-1 


The subscript n — 1 determines the number of block-rows. E and E f are nonsymmetric 
matrices of order r 


E = 

( AT iT- 1 

t r / r_l - 

l i n 

*o > 

Zi 

E' — 

nr' ('■ - D<5' 2 ' 

n;- 1 (r-i)(;- 2 • 

• 2i 0 1 \ 

• 2tx 1 


L / ^ / r “ 1 

\ V- 1 l r- 1 

" ^-i / 


l<:{ (r-l)C? ■ 

• 2i r -i 1 / 
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the vector X has the r(» - l)' 2 unknowns ■ • ■ > for l 'J — aiu * tho 

right, hand side 5 is explicitly given in [7]. 

The implicit collocation method (TCM) introduced here is of order four in space and of 
order r in time. To solve the heat equation by TCM, we first determine the coefficient a lJJe , 
and then compute the values of the polynomials at t = t k . Numerical results presented 

in [7] show that TCM is unconditionally stable and produces high accurate solutions. 


3 Computational Considerations 

To determine the approximated solution of the heat equation using CNM, we need to solve 
the linear system (4) where B is a matrix of order (n - l) 2 and bandwidth 2 n + 1. If we 
use ICM, we first solve the system (5) to obtain the coefficients a iJik ( i,j = 1,. . . ,n - 1 and 
k = 1, . . . ,r). Then the approximated values Ujj{t k ) {i,j = 1, . . . ,n — 1 and k = 1, . ,r — 1) 
are calculated using the polynomials Pi,j{t). Here, the matrix A is of order r(n — 1) _ and its 
bandwidth is equal to (2 n + l)r. The matrix A is r 2 times as large as B. 

For a given r, if we were to compare CNM (evaluated r consecutive times) and TCM 
using the same At, it is obvious that the implementation of CNM would be faster since 
the decompostion of the matrix A would be by far more computational demanding. The 
advantage of ICM is that it does not consist of determining the U hj {t k ) only (fc = 1, . . . , r- 1), 
but also allows us to find the approximated solution at any t in the interval [iolr-l]- For a 
“fair” comparison, we can choose different At for the two methods as follows. 

Let Ncnm — (n - l) 2 be the number of unknowns in (4) and let N icm = r(n - 1) the 
one in (5) where r is the degree of the polynomials. Let At irm be the time step used for TCM. 
The solution using ICM, can be determined at any point in the interval [0, (r — l)Aftcm]- 
(r— l)m be the number of equidistant points where the solution is to be computed in this time 
interval. To determine the solution at the same points of the interval with CNM, (r - l)m 
time iterations must be carried out. The corresponding time step is Af mm = At icm /rn. 

Given the above assumptions, we want to numerically determine the values of m for 
which ICM is cheaper than CNM for r and At icm fixed. In fact, we will compare the two 
methods over the interval [0,T] (where T = r?(r - l)A< icm ). In this case, both methods will 
be iterated until we reach the approximated solution at t — T. We assume ICM and CNM 
produce solutions of comparable accuracies. 

4 Description of the Parallel Implementation 

With the considerations laid out in the previous section, rj consective iterations of ICM will 
be carried out to obtain the solution over the time interval [0, T] (T = rj{r — l)A<j cm ). At 
each iteration of ICM, the solution will be approximated at (r - l)m time levels. If we want 
to determine the solution over the same interval [0. T] using CNM (A t cnm = Aticm/m ), CNM 
will be called (r — 1 ) 777/1 consecutive times. 

For n, r, A t icm , r) and m given, we can define the algorithms for both methods as 
follows: 

ICM’s Algorithm 

1. Define the matrix A 

2. Decompose the matrix A to obtain the matrix M ~ A~ l 
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3. For l = l — > r/, do: 

4. Defino the right hand side 5 

5. Determine the coefficients a t j^: MAX = MS 

G. Find Uij(tf.) at (r — l)m consecutive time steps by computing Pij(t^) 

7. End do 

CNM’s Algorithm 

1. Define the matrix £? 

2. Decompose the matrix B to obtain the matrix Q B~ l ) 

3. For l = 1 — > 77i(r — l)/}, do: 

4. Define the right hand side S = FU l ~ l + R 

5. Find the solution: QBU 1 = QS 
7. End do 

We present two strategies for implementing the two algorithms. We focus on ICM’s 
Algorithm with the assumption that the same techniques will be utilized for the one of CNM. 

Strategy 1: 

To solve the linear system of equations, we used the decomposition algorithm for inverting 
asymmetric and indefinite matrices proposed by Luo [8]. Here, the inverse of the matrix A 
is explicitly computed, i.e., M — (Step 2). It follows that Step 5 (X = MS) is just a 
matrix vector multiplication. Given the polynomial coefficients a t j^ (obtained in Step 5), 
Step 6 computes Pij{tk) using Horner’s algorithm. 

Strategy 2: 

In this case, a sparse matrix M % A~ l is explicitly computed and used as a preconditioner for 
the sparse matrix A (Step 2). A method based on Luo’s algorithm is employed by dropping 
(in the calculations of M) all the entries that are less than (in absolute value) a prescribed 
tolerance r > 0 [9]. To carry out Step 5 (MAX = MS ), the General Minimal Residual 
(GMRES) algorithm is used as the iterative accelerator. Finally Step 6 is the same as in the 
previous strategy. 

To avoid the storage of zero elements, the nonzero entries of the matrix A are stored by rows 
with the compressed row format. 

In Strategy i, no data compression is used. In the case of of ICM, it limits the number 
of spatial grid points we can choose. The decomposition of matrix A (ICM) will cost at least 
r 3 times more than the one of B (CNM). 

The fact to obtain the solution at (r — l)m consecutive time steps during one iteration 
of the ICM's algorithm shows that with ICM, the parallelization is achieved across both time 
and space. 

In [7], we observed that the choice of a small value of r (say r = 3 or r = 4) can be 
appropriate to obtain accurate solutions in a short amount of time. Such a choice limits the 
size of the matrix .4 which decomposition is the most demanding computation and the most 
difficult to parallelize. Since Steps 1 and 2 are done once, more time will be spent on right 
hand side updates (Step I), matrix-vector multiplications or iterative process (Step 5) and 
basic loop calculations (Step G) until the desired time step is reached. All these three steps 
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can easily be implemented on parallel computers and the efficiency or the algorithm will be 
improved. In the numerical experiments, we consider r — 3. 


5 Numerical Experiments 

The numerical experiments were conducted on a SGI Origin 2000 with 24 processors each 
running at 250 MHz and 512 MB memory. The programs were coded in Fortran 90 program- 
ming language in double precision with 64-bit arithmetic. The parallel implementation of the 
two algorithms was achieved by introducing parallel directives (thread commands) in all the 

major loops. 

For all our simulations, we consider Eq. 1 with the conditions 

a = — , y, 0) = sin nx + sin ny, 

7T 

/ 0 (y, t) = fi{y, t) = e~* sin iry, g 0 {x,t) = g\{x,t) = e~ l smnx. 

The exact solution is given by u(x, y, t) = (sin nx + sin7rj/)e 

To simplify our analysis, we take for all the experiments n = 32, r = 3, A t tcm — 0.01, 

For different values of m, we report the elapsed times obtained with both ICM and CNM 
when 4 processors are employed. The results appear in Figure 1 and Figure 2 foi Strategy 
1 and Strategy 2 (with the dropping parameter r = 10~ 5 ) respectively. We observe that for 
m > 80, ICM with Strategy 1 is more cost effective. If Strategy 2 is instead used, we need 
only > 10 to have ICM better than CNM. 

In addition, for ICM, Strategy 1 requires more time to reach the target time step. In fact, the 
factorization of the matrix A takes most of the computing time. We did not take advantage 
of the bandwidth of the matrix A. If we did, we would have reduced the factorization time 
at the expense of the parallel efficiency [9]. 




I L — 1 i 1 i 1 1 1 * 

10 20 30 40 50 60 70 30 90 100 


Figure 1: Strategy 1: elapsed time as func- 
tion of rn 


Figure 2: Strategy 2: elapsed time; as func- 
tion of m. 
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For m = 15, we present the speedup as function of the number of processors when TCM 
and CNM are both implemented with Strategy 1 and Strategy 2. 

From Figure 3, we note that when Strategy 1 is used, CNM scales very well across the 
processors whereas TCM is less efficient. If we examine the computing time required by each 
component of the algorithm, we see that for CNM, at most 7% of the total time is spent on 
the matrix decomposition. In ICM, the same decomposition takes at least 80% of the time. 
The Luo’s matrix decomposition has a moderate efficiency. This explains why ICM does not 
scale that well. 

On the other hand, if we focus on Strategy 2, the matrix factorization (done sequentially) 
takes a small percentage of the time (less than 13%) for both ICM and CNM (due in part 
for the data compression). However, ICM displays a better efficiency (Figure 4). This is due 
to the fact that ICM requires a larger problem size and therefore a better load balancing is 
achieved on the remaining part of the code (at least 87% of the total time) that is easy to 
parallelize. 


Speadtp - Strategy 1 




1 2 3 4 5 6 7 8 

Number of processors 


Figure 3: Strategy 1: speedup as function Figure 4: Strategy 2: speedup as function 

of the number of processors for m = 15. of the number of processors for m = 15. 

Finally, we observe that the approximate solutions were slightly more accurate with 
CNM for Strategy l and with ICM for Strategy 2. The maximum errors obtained were not 
significantly different to draw definitive conclusions. 


6 Conclusions 

We have compared the Crank-Nicolson method (CNM) and the implicit collocation method 
(ICM) in the numerical solution of the two dimensional heat equation. In both methods, the 
spatial derivatives are discretized with a high order finite difference scheme and the resulting 
approximations give linear systems of equations. We presented two parallel strategies to 
implement ICM and CNM on the SGI Origin 2000. Our numerical experiments showed that 
under some simple assumptions, ICM can require less elapsed time than CNM and can be 
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more efficient,. The larger number of unknowns in ICM can allow a better load balancing but 
unfortunately does not allow us to solve large size problems whereas CNM does. 

In our codes, we used parallel directives on all the major Do-loops. The matrix decom- 
positions used here were either done sequentially or did not display good efficiency. In future 
work, we plan to implement CNM with message passing directive and to explore parallel 
matrix decomposition algorithms. 
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